Abstract and Introduction
They introduce a new vector space representation where antonyms lie on opposite sides of a sphere: in the word vector space, synonyms have cosine similarities close to one, while antonyms are close to minus one.
This representation is derived with the aid of a thesaurus and LSA. Each entry in the thesaurus - a word sense along with its synonyms and antonyms - is treated as a “document”, and the resulting document collection is subjected to LSA. The key contribution of this work is to show how to assign signs to the entries in the co-occurrence matrix on which LSA operates, so as to induce a subspace with the despired property.
Futrher improvements result from refining the subspace representation with discriminative training, and augumenting the training data with general newspaper text.
They want to solve the problem that LSA might assign a high degree of similarity to opposites as well as synonyms.
- Able to discover new synonyms and antonyms
- The representation provides a natural starting point for gradient-descent based optimization
- It is straightforward to embed new words into the derived subspace by using information from a large unsupervised text corpus such as Wikipedia.
Related Woirk
The detextion of antonymy has been studied in a number of previous papers. These methods consist of two main steps:
- Detecting contrasting word categories and then determining the degree of antonymy.
- Determining the degree of antonymy
The automatic detecting of synonyms has been more extensively studied.
Laten Semantic Analysis
This is a widely used method for representing words and documents in a low dimensional vector space. The method is based on applying singular value decomposition (SVD) to a matrix $W$ which indicates the occurrence of words in documents. A $d\times n$ document-term matrix $W$ is formed. Its $ij^{th}$ entry can be term frequency or a TF-IDF weighting. The similarity between two documents can be computed using the cosine similarity of their corresponding row vectors. The similarity between two words can be computed using the cosine similarity of their corresponding column vectors. Finally, to obtain a subspace representation of dimension $k$, $W$ is decomposed as
$W \approx USV^T$
where $U$ is $x\times k$, $V^T$ is $k \times n$, and $S$ is a $k \times k$ diagonal matrix. In applications, $k \ll n$ and $k \ll d$.
Limitation of LSA
LSA is also confuse antonyms and synonyms. This paper want to make the least-similar words to a given a word be its opposites instead of words without relationship at all.
Polarity Inducing LSA
Words with opposite meaning will lie at opposite axes.
First, some lexical analysis are conducted to try to match an unknown word to one or more in-thesaurus words in their lemmatized forms. If now such match can be found, we then attempt to find semantically related inthesaurus words by leveraging co-occurrence statistics from general text data.
The problem, the size of vocabulary is not big enough. Because the thesaurus may contain enough words.
Discriminative Training
They propose a discriminvative training method to obtain a model used as a matrix $A$. The neural network transforms a $d \time 1$ vector $f$ into $k \times 1$ vector $g$.
Matching via Lexical Analysis
When a target word is not included in a thesaurus, it is quite often that some of its morphological variations are covered.
First, apply a morphological analyzer for English developed by Minnen. It not match, apply Porter’s stemmer. If both fails, removing hyphens is considered.
If there are more than one matched words, the centroid of their PILSA vectors is used to represent the target word.
Leveraging General Text Data
If no words in the thesaurus can be linked to the target word through the simple lexical analysis procedure, they try to find matched words by creating a context vector space model from a large document collection, and then mapping from this space to the PILSA space.
Context Vector Space Model
For each target word, a bag of words is created by collecting all the terms within a windows of [-10,+10] centered at each occurrence of the target word in the corpus. Using its TF-IDF value to build context-word matrix. Then LSA is performed on the context-word matrix.
Embedding Out-of-Vocabulary Words
A revised k-nearest neighbors approach is proposed. Given an out-of-thesaurus word $w$, K-nearest in-thesaurus neighbors are found in the context space. A subset of k members of these K words such that the pairwise similarity of each of the k members with every other is positive. The thesaurus-space centroid of these k items is computed as w’s representation.